Learning to rank Higgs boson candidates

In the extensive search for new physics, the precise measurement of the Higgs boson continues to play an important role. To this end, machine learning techniques have been recently applied to processes like the Higgs production via vector-boson fusion. In this paper, we propose to use algorithms for learning to rank, i.e., to rank events into a sorting order, first signal, then background, instead of algorithms for the classification into two classes, for this task. The fact that training is then performed on pairwise comparisons of signal and background events can effectively increase the amount of training data due to the quadratic number of possible combinations. This makes it robust to unbalanced data set scenarios and can improve the overall performance compared to pointwise models like the state-of-the-art boosted decision tree approach. In this work we compare our pairwise neural network algorithm, which is a combination of a convolutional neural network and the DirectRanker, with convolutional neural networks, multilayer perceptrons or boosted decision trees, which are commonly used algorithms in multiple Higgs production channels. Furthermore, we use so-called transfer learning techniques to improve overall performance on different data types.

The DirectRanker is a pairwise ranking approach. Pairwise ranking algorithms require two instances and decide which of them is more relevant. To achieve a consistent ranking, the DirectRanker implements a total quasiorder on the feature space F through its ranking function r : F × F → R with x y ⇐⇒ r(x, y) ≥ 0. This function satisfies the following properties: • Reflexivity: r(x, x) = 0 • Antisymmetry: r(x, y) = −r(y, x) • Transitivity: (r(x, y) ≥ 0 ∧ r(y, z) ≥ 0) ⇒ r(x, z) ≥ 0 To implement such a function, the model is divided into two parts. The first part is called feature part, made of fully connected layers. It learns a deep representation of the input instances. In Figure 3, this part is built out of two subnetworks nn 1 and nn 2 that share their parameters. As long as the two parts yield the same output for the same input, they can be made of any kind of layers. The extracted representation of the two subnetworks nn 1 and nn 2 is first substracted and than fed into the ranking part of the DirectRanker. This second part of the model is made of one neuron with a sign conserving activation and no bias. In Figure  3, this part is marked with o 1 . Fulfilling these conditions, the authors have proven that the DirectRanker 1 is able to implement a total quasiorder.
To evaluate the best hyperparameters, the authors of the DirectRanker explored multiple model architectures on common benchmark ranking tasks. For the experiments in this work, we extended the explored parameter space. All used parameters are shown in Table 1, where the overall best are indicated in boldface. As observed in the original paper 1 , we also found that larger models perform better with dropout or weight regularization, while a smaller net holds the same performance with the overall training time being shorter. Since the original DirectRanker was implemented in Tensorflow v1.15.0, we re-implemented it in Tensorflow v2.5.0 2 . This implementation allowed us to train the model in epochs using the whole data set multiple times, while the initial version was trained in one epoch only. We also implemented the possibility to use early stopping using a look-back parameter and stop the training if the performance does not increase on a separate validation set. Beside the L 2 cost function, we evaluated the performance using a cross-entropy cost. Furthermore, we explored Nadam 3 , an Adam-Optimizer using Nesterov Momentum and stochastic gradient descent (SGD) 4 additionally to the standard Adam-Optimizer 5 used in original work. Both of these changes lead to no significant increase in performance.

Convolutional Neural Network
The motivation why convolutional neural networks (CNN) 6 should work well for high-energy physics analysis is that they are to some degree invariant to shifting, scaling and distortion of the input data. Typically this is effective for image, sound or text Supplementary Table 2. Overview of the hyperparameters used for training the convolutional network. The best overall performance is indicated in boldface. The permutation layer represents a fully connected layer with the identical number of neurons as the input size. The task of this layer is to reorder the input features before they are fed into the convolutional layers. Adam, Nadam, SGD epoch 10, 50, 100 validation size 0.1 early stopping look-back 3, 6, no analysis, where the order of the data contains some information. Since the data collected in high-energy physics experiments is similar to image data, the properties of the CNNs can be beneficial. For many analyses in high-energy physics experiments, however, reconstructed data is used, where the given image-like data is converted into actual properties of physical particles. Although this can be the case, properties of the same particle are frequently placed next to each other in the reconstructed data. Therefore, numerous properties of one particle can be combined by using a CNN, which bears some potential to improve the classification result in this way. In this work, we always reconstruct 3 particles or jets having values, like transverse momentum or energy, located next to each other in the data. This motivated to use CNNs with 3 kernels able to combine the 3 values of the transverse momentum, and so forth.
In Table 2, the used hyperparameters evaluated for the numerous CNN models used in this work are shown. CNNs are typically composed of multiple convolutional layers having different filter sizes. On top of these convolutional layers, fully connected layers are stacked to further process the extracted information from the convolutional layers. For the convolutional layers, the best values was having one to two layers with 64 filters. The best activation function used for the convolutional layer was an S-shaped Rectified Linear Unit (SRelu) 7 , which is able to learn both convex and non-convex functions. The best performing hyperparameters for the fully connected layers were 50 neurons and using a Relu function. If the number of layers is greater, weight regularization performed better, while for fewer layers no weight regularization was chosen. The used cost function for this model was cross-entropy. The best batch size was 200 with 10 epochs having the same parameters as the DirectRanker. In addition, the Adam-Optimizer was employed and the early stopping look-back was set to 3.

Combination of Ranking and CNN Layers
For combining the benefits of the CNN and the DirectRanker, we propose a combination of both. We use CNN layers to extract feature combinations, which are then added to the original features, before they are fed into the DirectRanker. Figure 3 shows this combination. To make this combination work, a couple of difficulties have to be overcome. First of all, the two networks in the ranking part of the DirectRanker need to be identical. To fulfil this, the CNN layers and the subsequent fully connected layer needs to have shared weights and biases. Another problem occurs when combining the extracted features with the original ones, since the extraction will likely transform the features such that they obtain a different distribution than the original ones. Therefore, a batch normalization layer 8 was inserted for normalizing the extracted features. In addition, the number of extracted features is important: It determines which of the two approaches has more influence on the result. If the last layer of the CNN containts a substantial number of neurons, the overall results depends more on the CNN rather than on Supplementary Table 3. Overview of the hyperparameters used for training the combination of ranking and CNN layers. The best overall performance is indicated in boldface. The permutation layer represents a fully connected layer with the same number of neurons as the input size. The task of this layer is to reorder the input features before they are fed into the convolutional layers. Adam, Nadam, SGD epoch 10, 50, 100 validation size 0.1 early stopping look-back 3, 6, no the DirectRanker part. In our hyperparameter optimization, we found that having this number moderate to low gives the best results. Training the combination can be done in the same way as the DirectRanker is trained. This leads to the advantage that the number of instances presented to the network increases, since it can sample from all combinations between signal and background. This is in contrast to standard classification, which samples just from either signal or background.
In Table 3, all hyperparameters of the model are shown. The bold ones obtained the most outstanding performance, while regularization parameters like dropout or weight regularization worked better for bigger networks. On average, the hyperparameters finally taken for the combination are more or less equal to the best ones for the individual models. For the normalization layer, the best hyperparameter was layer normalization 9 . In contrast to batch normalization, the mean and variance during layer normalization is calculated from the summed inputs of a layer on a particular training case.

Multilayer Perceptron
The used multilayer perceptron (MLP) maintained an input layer with the identical size as the input data, various numbers of hidden layers, and one output layer containing one neuron. The hyperparameters used for the MLP can be seen in Table 4. The size of the fully connected hidden layers was changed from 2 to 8 layers containing 5 to 256 neurons per layer. The number of layers was always set in a descending order. For the activation function for the hidden layers and the output layer, tanh and Relu were used, while tanh performed best in most of the experiments. Once more we saw that bigger networks work better when dropout or L 2 weight regularization was applied. Moreover, smaller ones did not need such regularization, while still achieving similar performance. The best batch size, epoch numbers, and early stopping look-back were once again: 200, 10, and 3. In addition, the best optimizer was Adam.

Boosted Decision Tree
The current state-of-the-art algorithm used for high-energy physics search is the boosted decision tree. The implementation evaluated in this work is from the scikit-learn 0.24.2 library 10 . The AdaBoost algorithm 11 is used to boost a standard decision, including a maximum depth and a maximum coverage hyperparameter.